Methods and apparatus to generate anomaly detection datasets

ABSTRACT

Example methods and apparatus to generate anomaly detection datasets are disclosed. An example method to generate an anomaly detection dataset for training a machine learning model to detect real world anomalies includes receiving a user definition of an anomaly generator function, executing, with a processor, the anomaly generator function to generate user-defined anomaly data, and combining the user-defined anomaly data with nominal data to generate the anomaly detection dataset.

RELATED APPLICATION

This patent arises from a continuation of U.S. patent application Ser.No. 15/591,886, filed May 10, 2017. U.S. patent application Ser. No.15/591,886 is hereby incorporated herein by reference in its entirety.

FIELD OF THE DISCLOSURE

The present disclosure relates generally to anomaly detection and, moreparticularly, to methods and apparatus to generate anomaly detectiondatasets.

BACKGROUND

Machine learning is a type of artificial intelligence that providescomputers with the ability to learn without being explicitly programmed.Machine learning focuses on systems that can change when exposed to newdata. Interest in machine learning is increasing due to a combination ofadvances in computing, big data management, and machine learningalgorithms, among other things. Anomaly detection is the process ofidentifying outliers in the inputs for a problem domain (e.g., MRI imageinterpretation). Example anomalies include, but are not limited to, atumor in an MRI image, a fraudulent credit card fraud, a failure in amechanical part or control system, etc. Some example impacts ofanomalies include, but are not limited to, monetary losses, propertydamage, loss of life, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example machine learning system having a datasetgenerator, in accordance with this disclosure.

FIG. 2 is a block diagram illustrating an example implementation for theexample dataset generator of FIG. 1 .

FIG. 3 is a listing of source code that may be executed to implement theexample dataset generators of FIGS. 1 and 2 .

FIG. 4 is a graph of example nominal data.

FIG. 5 is graphs of example test results data using an original trainingdataset.

FIG. 6 is graphs of example test results data using an example anomalydetection dataset generated in accordance with this disclosure.

FIG. 7 is a table of example performance results for the examples ofFIGS. 5 and 6 .

FIG. 8 is a flow diagram representing example processes that may beimplemented as machine-readable instructions that may be executed toimplement the example dataset generators of FIGS. 1 and 2 to generateanomaly detection datasets.

FIG. 9 illustrates an example processor system structured to execute theexample instructions of FIG. 8 to implement the example datasetgenerators of FIGS. 1 and 2 .

DETAILED DESCRIPTION

One of the key problems that prevents a wider adoption of machinelearning is the lack of available datasets for training machine learningmodels. A dataset, which is a collection of, sometimes labeled, data, isneeded to train a machine learning model for a particular problemdomain. This is generally true regardless of whether the machinelearning algorithm is supervised, semi-supervised, or unsupervised.Because of this, obtaining an appropriate dataset is usually one of thefirst steps in developing, training, validating, and testing machinelearning models. It is widely accepted in the machine learning communitythat large datasets are needed to increase model classification accuracyand generalization, especially for deep learning.

The development of an appropriate machine learning model usuallyrequires data of varying degrees, from simple to complex, to enablemachine learning developers to continually refine and develop themachine learning model until it reaches an acceptable level ofclassification generalization. In practice, many machine learning modelsare generated through an iterative process of training and re-trainingon a growing dataset, which can be repeated numerous times before themodel is ready for deployment. Because of this, manual manipulation ofan existing dataset into smaller and larger datasets is often required,which increases engineering and research overhead.

Anomaly detection adds even more complexity to dataset generation andidentification. Anomalies are, by definition, infrequent or rare, andtherefore building (e.g., training) an accurate anomaly detection modelcan be challenging due to the scarcity of anomalous data and events indatasets. Further, anomalies tend to be continuous events, which meansdata presented for them must usually be in a time-series ordered form.Conventional wisdom is that a dataset need to be 100 times the size ofthe machine learning model. For example, a small neural network todayhas 250,000 nodes, so a dataset with 25 million data inputs may beneeded. Because anomalies are rare, even larger datasets may berequired. Many available datasets contain few, if any, anomalies. For atleast these reasons, datasets for anomaly detection are oftenprohibitively expensive, prohibitively time consuming to generate,unavailable, of insufficient depth, of insufficient size, etc.

Example dataset generators that overcome at least these problems, andthat can easily, quickly and inexpensively generate large anomalydetection datasets for anomaly detection are disclosed herein. Theexample methods and apparatus disclosed herein can be inexpensivelyimplemented, and can generate anomaly detection datasets that can, forexample, have very large numbers and varieties of anomalies, havelabelled anomalies, etc. A benefit of generating such large anomalydetection datasets is the ready ability to train deep neural networksfor anomaly detection, where anomalous data tends to be scarce and thecurrent state-of-the-art anomaly datasets are not as mature as datasetsfor other problem domains. Disclosed examples can generate anomaliesthat are short or long in duration, are discrete or continuous, orcombinations thereof. The characteristics and/or occurrence of anomaliescan be readily defined using probabilities, multi-variate functions,etc. In some examples, a user can define anomaly generation using a fewsimple, highly programmable functions defined by a small amount ofsource code or interpretable code. For instance, it has beenadvantageously discovered that simple functions (e.g., a random numbergenerator) can be used to generate anomalies that can be used to train amachine learning model to detect real world anomalies. In some example,real world anomalies are anomalies that can, do, or may occur duringactual usage of a machine learning model. In contrast, in some examples,anomalies generated herein are generated using a random generationfunction. Such simple functions are inexpensive and efficient toimplement. Some examples reduce or eliminate replicated or repeated datathat can cause overfitting, a negative side-effect in machine learningtraining that can degrade a model's classification correctness andgeneralizability

The disclosed example dataset generators can be implemented using largenumbers of processor cores. In some examples, functions are executed bythe processor cores in parallel and fully independently, without needfor inter-communication, dependencies, memory sharing, data sharing,memory synchronization serialization overhead, etc. In the industry,such examples are sometimes referred to as embarrassingly parallelsystems. Such examples can enable a dramatic increase in the size of andthe speed at which datasets can be generated and subsequently used, inpractice, to train machine learning models to detect anomalies.Accordingly, machine learning researchers and practitioners can begintesting the generalization of a trained machine learning model withinminutes, instead of hours or days or weeks as may be necessary with theexisting datasets.

While the examples disclosed herein are described with reference toanomaly detection and/or anomaly detection datasets, it should beunderstood that the examples disclosed herein may be applied to generatedatasets for other problem domains. Moreover, while examples of anomalygeneration are disclosed herein, it should be understood that anomaliesmay be generated in other ways.

FIG. 1 is a block diagram of an example machine learning system 100. Togenerate anomaly data, the example machine learning system 100 of FIG. 1includes an example dataset generator 110, in accordance with thisdisclosure. The example dataset generator 110 of FIG. 1 executes one ormore example user-defined generation functions 120 to generate ananomaly detection dataset 130. The example user-defined generationfunctions 120 generate nominal (e.g., normal, typical, average, etc.)data, and anomaly data. In some examples, the dataset generator 110generates the nominal data from nominal data input(s) 140 in the formof, for example, sample nominal signal waveforms, a prior nominaldataset, etc. for a problem domain. In some examples, the datasetgenerator 110 generates the anomaly data from anomaly data input(s) 142in the form of, for example, sample anomaly signal waveforms, a prioranomaly dataset, etc. for a problem domain. In some examples, nominaldata and anomaly data are data slices (e.g., a time sequence of datavalues) that are spliced together (e.g., arranged in sequence,interspersed, etc.) by the dataset generator 110 to form data that canbe used to train a machine learning model for anomaly detection. In someexamples, the user-defined generation functions 120 can be defined by auser in the form of simple, highly programmable functions defined by asmall amount of simple source code or interpretable code that can beexecuted to implement the user-defined generation functions 120.

Repetition of data in the example anomaly detection dataset 130 canresult in model overfitting, wherein a machine learning model modelsrandom error or noise instead of underlying relationships. Machinelearning models are often more susceptible to overfitting due toreplicated anomaly data than nominal data. Nominal data may not have thesame sensitivity to overfitting as anomaly data due to its larger size,and because it may already span a broad range of values due, again, toits larger size compared to anomaly data. Accordingly, in some example,a user-defined generation function 120 includes one or more uniquenessparameters to control (e.g., limit, restrict, etc.) the repetition ofdata. Example uniqueness parameters set hard or soft limits on datarepetition. In some examples, data in the anomaly detection dataset 130is labeled. Relative to anomaly detection, data may be labeled “nominal”or “anomalous.”

In some examples, multiple instances of the user-defined generationfunctions 120 are executed in parallel on any number and/or type(s) ofexample processor cores 150. In some examples, the multiple instancesare executed by an embarrassingly parallel system. Making the generationof large datasets having, for example, tens or hundreds of millions ofdata inputs time efficient and cost effective. Thereby, making theimplementation of deep learning, and the use of huge datasets—so called“large data” for problem domains of increasingly complexity, technicallyand economically feasible.

A processor core 150 may be implemented by, for example, one or moreanalog or digital circuit(s), logic circuits, programmable processor(s),graphics processing unit(s) (GPU(s)), digital signal processor(s)(DSP(s)), application specific integrated circuit(s) (ASIC(s)),programmable logic device(s) (PLD(s)), field programmable logicdevice(s) (FPLD(s)), etc.

The example anomaly detection dataset 130 of FIG. 1 can be used to trainan example machine learning model 160 in the form of, for example, aneural network. The neural network may have any number of nodes in anytopology. The machine learning model 160 may be trained using any numberand/or type(s) of algorithm(s), method(s), process(es), etc.

To test the performance of the machine learning model 160, the examplemachine learning system 100 includes an example performance analyzer170. The example performance analyzer 170 of FIG. 1 compares the labelsof test inputs of the machine learning model 160 with outputclassifications of “nominal” or “anomalous” of the machine learningmodel 160 for the test inputs. The performance analyzer 170 determinesperformance metrics for the machine learning model 160 based on thecomparisons.

FIG. 2 is a block diagram of an example dataset generator 200 that maybe used to implement the example dataset generator 110 of FIG. 1 . Togenerate nominal data, the example dataset generator 200 of FIG. 2includes an example nominal data generator 202. To generate anomalydata, the example dataset generator 200 of FIG. 2 includes an exampleanomaly data generator 204. Because the example user-defined generationfunctions 120 can be defined separately for anomaly data and nominaldata, the example nominal data generator 202 and the example anomalydata generator 204 can execute user-defined generation functions 120 inparallel. Moreover, the user-defined generation functions 120 allow theanomaly data and/or the nominal data for different time periods to begenerated in parallel. In the example of FIG. 1 , the example nominaldata generator 202 and the example anomaly data generator 204 areimplemented identically, and are differentiated based on providedcontrol inputs and/or user-defined generation functions. For example, byproviding different user-defined generation functions to the nominaldata generator 202 and the anomaly data generator 204 they can implementdifferent functionality. However, the nominal data generator 202 and theanomaly data generator 204 may be implemented differently.

To receive, from a user, definitions of user-defined generationfunctions (e.g., the example user-defined generation functions 120), theexample dataset generator 200 of FIG. 2 includes an example applicationprogramming interface (API) 206. The example API 206 of FIG. 2 enables auser to manually or programmatically provide, enter, upload, manage,etc. the definition of generation functions for nominal data and/oranomaly data. In some examples, generation functions are defined andprovided to the API 206 in the form of one or more files containingsource code or interpretable code. Additionally, and/or alternatively,generation functions may be provided via the API 206 as executablefiles. Example user-defined generation functions are discussed below inconnection with FIG. 3 .

In addition to the user-defined generation functions 120, the exampleAPI 206 enables a user to manually or programmatically provide, enter,upload, manage, etc. generation specifications 208 for the nominal dataand/or anomaly data to be generated. Example generation specifications208 include the number of entries (e.g., data slices) to be generated,one or more rules regarding uniqueness (e.g., do not allow, allow afteran amount of time has passed, etc.), one or more random variables, oneor more function variables, the period(s) of time of the nominal data tobe generated, the period(s) of time of anomalous data to be generated,and the likelihood or probability of anomalies (e.g., 0.5% of the time).Example generation specifications 208 are discussed below in connectionwith FIG. 3 . In some examples, the generation specifications 208 arepart of the user-defined generation functions 120.

To define the data to generate, the example dataset generator 200includes an example scheduler 210. The example scheduler 210 of FIG. 2 ,based on the generation specifications 208, determines a timeline ofdata to be generated, which includes the type(s) of data that will becreated for each period (e.g., nominal or anomalous), the duration ofeach data slice, and their sequential ordering. For each of the nominaldata generator 202 and the anomaly data generator 204, the examplescheduler 210 creates (e.g., spawns, instantiates, etc.) an instance ofan example data generator 212, and identifies the applicable portion(s)of the timeline of data to be generated to the created data generator212. For instance, the portion(s) of the timeline relating to anomalydata generation is passed to the data generator 212 of the anomaly datagenerator 204.

The example data generators 212 of FIG. 2 create (e.g., spawn,instantiate, etc.) one or more example data generation instances 214,which are instances of applicable ones of the user-defined generationfunctions 120. For instance, the example data generator 212 of theanomaly data generator 204 creates data generation instances 214 of ananomaly generation function. The data generators 212 pass to each oftheir data generation instances 214 a time period of data to create,including all the necessary variable data needed to create complete theintended data for the time period. In some examples, the number of datageneration instances 214 created by a data generator 212 is a heuristicbased on several factors, such as the number of processor cores 150available, the amount of the data to be created, the supplied uniquenessrules, probability of duplicate data entries, computational overhead tofind and replace each duplicate, the total number of data values to becreated for the thread, etc.

The example data generation instances 214 of FIG. 1 generate anomaly ornominal data based the generation specifications 208 provided by theuser, including calling into user-defined function variable functions,and the time period specified by its data generator 212. Each datageneration instance 214 stores the data slices it generates in its ownmap of timestamps and the data. If duplicate data is generated within adata generation instance's map, and unique values are required asspecified by the user, the data generation instance 214 will discard theduplicate data and generate new data for that time period. The datageneration instances 214 can be executed in serial and/or parallel onthe one or more processor cores 150. In some examples, the datageneration instances 214 are executed by an embarrassingly parallelsystem.

To combine the data generated by the example data generation instances214, the example nominal data generator 202 and the example anomaly datagenerator 202 include a respective example data merger 216. The exampledata mergers 216 of FIG. 2 collect from the data generation instances215 the data slices generated by the data generated instances 215 into,for example, a single map or data structure.

To remove duplicate data in the data collected by the example datamergers 216, the example nominal data generator 202 and the exampleanomaly data generator 204 include a respective example data replicationmanager 218. The data replication managers 218 eliminating duplicatedata entries that may have been independently generated by two or moreof the data generation instances 214. When duplicates are found, thedata replication managers 218 handle the duplicate data as specified byuniqueness rules specified by the user.

To combine the nominal data generated by the example nominal datagenerator 202 with the anomaly data generated by the example anomalydata generator 204 the example dataset generator 200 includes an exampledataset merger 220. The example dataset merger 220 combines the nominaldata and the anomaly data by splicing the nominal data slices with theanomaly data slices based on the time-series periods associated piecesof the anomaly data and the nominal data. The time-series ordered datais stored in the example anomaly detection dataset 130 in a formatspecified by the user.

While example implementations of the example dataset generator 100, theexample dataset generator 200, the example nominal data generator 202,the example anomaly data generator 204, the example API 206, the examplescheduler 210, the example data generators 212, the example datageneration instances 214, the example data mergers 216, the example datareplication managers 218, and the example dataset merger 220 are shownin FIGS. 1 and 2 , one or more of the elements, processes and/or devicesillustrated in FIGS. 1 and 2 may be combined, divided, re-arranged,omitted, eliminated and/or implemented in any other way. Further, theexample dataset generator 100, the example dataset generator 200, theexample nominal data generator 202, the example anomaly data generator204, the example API 206, the example scheduler 210, the example datagenerators 212, the example data generation instances 214, the exampledata mergers 216, the example data replication managers 218, and theexample dataset merger 220 of FIGS. 1 and 2 may be implemented byhardware, software, firmware and/or any combination of hardware,software, and/or firmware. Thus, for example, any of the example datasetgenerator 100, the example dataset generator 200, the example nominaldata generator 202, the example anomaly data generator 204, the exampleAPI 206, the example scheduler 210, the example data generators 212, theexample data generation instances 214, the example data mergers 216, theexample data replication managers 218, and the example dataset merger220 could be implemented by one or more analog or digital circuit(s),logic circuits, programmable processor(s), GPU(s), DSP(s), ASIC(s),PLD(s), and/or FPLD(s). When reading any of the apparatus or systemclaims of this patent to cover a purely software and/or firmwareimplementation, at least one of the example dataset generator 100, theexample dataset generator 200, the example nominal data generator 202,the example anomaly data generator 204, the example API 206, the examplescheduler 210, the example data generators 212, the example datageneration instances 214, the example data mergers 216, the example datareplication managers 218, and the example dataset merger 220 is/arehereby expressly defined to include a tangible computer-readable storagemedium storing the software and/or firmware. Further still, the exampledataset generator 100, the example dataset generator 200, the examplenominal data generator 202, the example anomaly data generator 204, theexample API 206, the example scheduler 210, the example data generators212, the example data generation instances 214, the example data mergers216, the example data replication managers 218, and the example datasetmerger 220 of FIGS. 1 and 2 may include one or more elements, processesand/or devices in addition to, or instead of, those illustrated in FIGS.1 and 2 , and/or may include more than one of any or all of theillustrated elements, processes and devices.

FIG. 3 is a listing of example source code 300 that may be used toimplement the example user-defined generation functions 120 and/or theexample generation specifications 208. The example source code 300 ofFIG. 3 includes an example user-defined generation function 302 togenerate nominal data, and an example user-defined generation function304 to generate anomaly data. The example user-defined generationfunction 302 of FIG. 3 generates sinusoidal nominal data. The exampleuser-defined generation function 304 of FIG. 3 generates random anomalydata. In some examples, a user-defined generation function for nominaldata creates copies of one or more nominal data inputs 140 (see FIGS. 1and 2 ). In some examples, a user-defined generation function foranomaly data creates copies of one or more anomaly data inputs 142 (seeFIGS. 1 and 2 ). The example function 302 of FIG. 3 generates a nominaldata value at a specified timestamp, and returns the generated nominaldata value. The example function 304 of FIG. 3 generates an anomaly datavalue at a specified timestamp, and returns the generated anomaly datavalue. While example functions 302 and 304 are shown in FIG. 3 ,user-defined generation functions may be created to generate nominal andanomaly data in any way.

As demonstrated below in connection with FIGS. 4-7 , has beenadvantageously discovered that simple anomaly generation functions, suchas the example random function of the user-defined generation function304, can be used to generate anomalies that can be used to train amachine learning model to detect real world anomalies. Such simplefunctions are inexpensive and efficient to implement. The use of suchuser-defined generation functions can be simpler than capturing samplereal world anomalies, defining specific anomalies, etc.

To store information regarding the functions 302 and 304, the examplesource code 300 includes an example function variable 306. The examplefunction variable 306 includes a name “sine-wave” and pointers to theexample functions 302 and 304.

To store data generation configuration information, the example sourcecode 300 of FIG. 3 includes another example object 308. The exampleobject 308 of FIG. 3 is created based the following example parameters310:

-   -   start_time—specifies the start time for the dataset    -   end_time—specifies the end time for the dataset    -   nominal_lowerbound_duration—specifies the minimum duration for a        nominal data slice    -   normal_upperbound_duration—specifies the maximum duration for a        nominal data slice    -   anomaly_lowerbound_duration—specifies the minimum duration for        an anomaly data slice    -   anomaly_upperbound_duration—specifies the maximum duration for        an anomaly data slice    -   anomaly_probability—specifies the probability of an anomaly        occurring. In the example of FIG. 3 , the anomaly probability is        specified as an integral value and the probability is computed        as 1/<specified-integral-value>.

The example object 308 of FIG. 3 includes a method 312 to add thefunction variable 306, a method 314 executed by the scheduler 210 anddata generators 212 to generate the nominal and anomaly time slices, anda method 316 executed by the data mergers 216 to collect the datagenerated by the example data generation instances 214.

Example real world performance improvements that may be obtained usingthe teachings of this disclosure will now be described in connectionwith FIGS. 4-9 . FIG. 4 is a graph of an example real world nominalsignal 400 for the Space Shuttle.

FIG. 5 illustrates example anomaly detection results using conventionaltraining data. FIG. 5 includes an example graph 502 including a portionof example conventional training data taken from a publicly availabledataset having manually labeled anomaly and nominal data. FIG. 5 alsoincludes an example graph 504 including a portion of example test datafor the example original training data. Using a long, short-term memory(LSTM) neural network, another example graph 506 of FIG. 5 shows anomalydetection probabilities for the example test data of graph 504, and yetanother graph 508 shows rounded anomaly detection probabilities. Ananomaly is detected when a value of the graph 508 is one.

FIG. 6 illustrates example anomaly detection results showing exampleimprovements obtained by using an anomaly detection dataset according tothis disclosure. In FIG. 6 , the same LSTM neural network is trainedusing training data 602 that has been generated according to thisdisclosure rather than the conventional data of graph 502. Aftertraining the LSTM neural network with the training data 602, accordingto this disclosure, rather than original training data 502, the LSTMneural network is tested with the same test data 504. Comparisons of thegraphs 506 and 508 of FIG. 5 with respective example graphs 604 and 606,the LSTM neural network detects more anomalies in the test data 504using the training data 602, according to this disclosure, rather thanthe original training data 502.

FIG. 7 is a table comparing the example results of FIGS. 5 and 6 . Theresults of FIG. 7 demonstrate a 4.3% improvement in overall accuracy, a21× improvement in recall, and a 19× improvement in F1 score. An F1score is a statistical measure of accuracy for binary classificationsystems. An F1 score reflects a balanced metric of precision and recall.Thus, a large improvement in an F1 score reflects meaningful real worldperformance improvement. The decrease in precision shown in FIG. 7reflects a bias in the original results toward only declaring an anomalywhen there is high probability of anomaly. That is, the original testresults were biased against making an error. This is also reflected inthe very low recall value.

FIG. 8 is a flow diagram representative of example process(es) that maybe implemented as coded computer-readable instructions, the codedinstructions may be executed to implement the dataset generators 110 and200 of FIGS. 1 and 2 to generate anomaly detection datasets. In thisexample, the coded instructions comprise one or more programs forexecution by a processor such as the processor 912 shown in the exampleprocessor platform 900 discussed below in connection with FIG. 9 . Theprogram(s) may be embodied in the coded instructions and stored on oneor more tangible computer-readable storage mediums associated with theprocessor 912. One or more of the program(s) and/or parts thereof couldalternatively be executed by a device other than the processor 912. Oneor more of the programs may be embodied in firmware or dedicatedhardware. Further, although the example process(s) is/are described withreference to the flowchart illustrated in FIG. 8 , many other methods ofimplementing the example dataset generators 110 and 200 mayalternatively be used. For example, the order of execution of the blocksmay be changed, and/or some of the blocks described may be changed,eliminated, or combined.

As mentioned above, the example process(es) of FIG. 8 may be implementedusing coded instructions (e.g., computer-readable instructions and/ormachine-readable instructions) stored on one or more tangiblecomputer-readable storage mediums. As used herein, the term tangiblecomputer-readable storage medium is expressly defined to include anytype of computer-readable storage device and/or storage disk and toexclude propagating signals and to exclude transmission media. As usedherein, “tangible computer-readable storage medium” and “tangiblemachine-readable storage medium” are used interchangeably. Additionally,or alternatively, the example process(es) of FIG. 8 may be implementedusing coded instructions (e.g., computer-readable instructions and/ormachine-readable instructions) stored on one or more non-transitorycomputer mediums. As used herein, the term non-transitorycomputer-readable storage medium is expressly defined to include anytype of computer-readable storage device and/or storage disk and toexclude propagating signals and to exclude transmission media. As usedherein, “non-transitory computer-readable storage medium” and“non-transitory machine-readable storage medium” are usedinterchangeably.

Example tangible computer-readable storage mediums include, but are notlimited to, any tangible computer-readable storage device or tangiblecomputer-readable storage disk such as a memory associated with aprocessor, a memory device, a flash drive, a digital versatile disk(DVD), a compact disc (CD), a Blu-ray disk, a floppy disk, a hard diskdrive, a random access memory (RAM), a read-only memory (ROM), etc.and/or any other storage device or storage disk in which information isstored for any duration (e.g., for extended time periods, permanently,for brief instances, for temporarily buffering, and/or for caching ofthe information).

The example process of FIG. 8 includes the example API 206 receivingdefinitions of user-defined generator functions (block 802), and theexample scheduler 210 defining the data slices for the nominal datagenerator 202 and the anomaly data generator 204 to generate (block804). For the example of FIG. 6 , an example scheduler 210 specified ananomaly detection dataset as repetitions of the example real worldnominal signal 400 for the Space Shuttle, with portions that arereplaced with anomaly data. Thus, portions of the repeating nominalsignal (nominal data slices) are time interspersed, spliced, etc.together with anomaly data slices. In FIG. 6 , example time periods 610,611, 612, 613, 614, 615 are assigned to the anomaly data generator 204for the generation of anomaly data slices (e.g., blocks of values,string of values, etc.) that are to be placed in those time periods610-615. Intervening time periods (two examples of which are designatedat reference numbers 620 and 621), are assigned to the nominal datagenerator 202 for generation of repetitions of the nominal input signal140. The scheduler 210 can specify the time stamps corresponding to thetime periods 610-615 to the anomaly data generator 204, and the timestamps corresponding to the time periods 620-621 to the nominal datagenerator 202.

The example data generator instances 214 of the nominal data generator202 generates nominal data for the nominal time periods 620-621 (block806), and the data replication manager 218 handles data replication, ifany (block 808). The example data generator instances 214 of the anomalydata generator 204 generates anomaly data for the anomaly time periods610-615 (block 810), and the data replication manager 218 handles datareplication, if any (block 812).

The example dataset merger 220 combines the generated nominal dataslices and the generated anomaly data slices to form the anomalydetection dataset 130 (block 814). For example, the dataset merger 220creates an ordered sequence including an anomaly data slice generatedfor the time period 610, a nominal data slice generated for the timeperiod 620, another anomaly data slice generated for the time period611, etc. A machine learning model is trained with the anomaly detectiondataset 130 (block 816), and control exits from the example process ofFIG. 8 .

In FIG. 6 , a simple random function, such as the example user-definedgeneration function 304, is used to generate the anomaly data slices,the anomaly data slices can be generated fully independently from eachother, and from the generation of nominal data slices. Likewise, becausethe nominal data slices are portions of a repeating signal, they can begenerated fully independently from each other from a single timestamp.Because the nominal and anomaly data slices for the time periods 604-611are, thus, fully independent, nominal data slices and anomaly dataslices can be generated by a plurality of processor cores executing inparallel and fully independently. Because the data slices areindependent, there is no computational burden to combine the dataslices. Instead, they may be combined by simple data ordering, as shownherein. Accordingly, very large (e.g., >100 million data inputs) anomalydetection datasets can be easily, quickly and inexpensively defined andgenerated.

FIG. 9 is a block diagram of an example processor platform 900configured to execute the process(es) of FIG. 8 to implement the exampledataset generators 110 and 200. The processor platform 900 can be, forexample, a server, a personal computer, or any other type of computingdevice.

The processor platform 900 of the illustrated example includes aprocessor 912. The processor 912 of the illustrated example is hardware.For example, the processor 912 can be implemented by one or moreintegrated circuits, logic circuits, microprocessors, GPUs, orcontrollers from any desired family or manufacturer.

In the illustrated example, the processor 912 implements the exampledataset generator 200, the example nominal data generator 202, theexample anomaly data generator 204, the example API 206, the examplescheduler 210, the example data generators 212, the example datageneration instances 214, the example data mergers 216, the example datareplication managers 218, and the example dataset merger 220 describedabove in connection with FIG. 2 .

The processor 912 of the illustrated example includes a local memory 913(e.g., a cache). The processor 912 of the illustrated example is incommunication with a main memory including a volatile memory 914 and anon-volatile memory 916 via a bus 918. The volatile memory 914 may beimplemented by Synchronous Dynamic Random Access Memory (SDRAM), DynamicRandom Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM)and/or any other type of random access memory (RAM) device. Thenon-volatile memory 916 may be implemented by flash memory and/or anyother desired type of memory device. Access to the main memory 914, 916is controlled by a memory controller.

The processor platform 900 of the illustrated example also includes aninterface circuit 920. The interface circuit 920 may be implemented byany type of interface standard, such as an Ethernet interface, auniversal serial bus (USB), and/or a PCI express interface.

In the illustrated example, one or more input devices 922 are connectedto the interface circuit 920. The input device(s) 922 permit(s) a userto enter data and commands into the processor 912. The input device(s)can be implemented by, for example, an audio sensor, a microphone, acamera (still or video), a keyboard, a button, a mouse, a touchscreen, atrack-pad, a trackball, isopoint and/or a voice recognition system.

One or more output devices 924 are also connected to the interfacecircuit 920 of the illustrated example. The output devices 924 can beimplemented, for example, by display devices (e.g., a light emittingdiode (LED), an organic light emitting diode (OLED), a liquid crystaldisplay, a cathode ray tube display (CRT), a touchscreen, a tactileoutput device, a light emitting diode (LED), a printer and/or speakers).The interface circuit 920 of the illustrated example, thus, typicallyincludes a graphics driver card, a graphics driver chip or a graphicsdriver processor.

The interface circuit 920 of the illustrated example also includes acommunication device such as a transmitter, a receiver, a transceiver, amodem and/or network interface card to facilitate exchange of data withexternal machines (e.g., computing devices of any kind) via a network926 (e.g., an Ethernet connection, a digital subscriber line (DSL), atelephone line, coaxial cable, a cellular telephone system, etc.).

The processor platform 900 of the illustrated example also includes oneor more mass storage devices 928 for storing software and/or data.Examples of such mass storage devices 928 include floppy disk drives,hard drive disks, compact disk drives, Blu-ray disk drives, RAIDsystems, and digital versatile disk (DVD) drives.

Coded instructions 932 include the machine-readable instructions of FIG.8 and may be stored in the mass storage device 928, in the volatilememory 914, in the non-volatile memory 916, and/or on a removabletangible computer readable storage medium such as a CD or DVD.

From the foregoing, it will be appreciated that methods, apparatus andarticles of manufacture have been disclosed which enhance the operationsof a computer to, by among other things, providing anomaly detectiondatasets that can be used to train machine learning modules that performsignificantly better than conventional datasets. From the foregoing, itwill be further appreciated that methods, apparatus and articles ofmanufacture have been disclosed which enhance the operations of acomputer to, by among other things, generate anomaly detection datasetsthat are larger, more robust, can be generated more cost effectively,can be generated more efficiently, can be generated by embarrassinglyparallel systems, etc.

Example methods, apparatus, and articles of manufacture to generateanomaly detection datasets are disclosed herein. Further examples andcombinations thereof include at least the following:

Example 1 is a method to generate an anomaly detection dataset fortraining a machine learning model to detect real world anomaliesincluding receiving a user definition of an anomaly generator function,executing, with a processor, the anomaly generator function to generateuser-defined anomaly data, and combining the user-defined anomaly datawith nominal data to generate the anomaly detection dataset.

Example 2 includes the method of example 1, the user definition of theanomaly generator function including a uniqueness parameter to controldata repetition.

Example 3 includes the method of example 1 or 2, further including,executing, with a plurality of processor cores, respective ones of aplurality of instances of the anomaly generator function, each of theplurality of instances to generate respective ones of a plurality ofuser-defined anomaly data slices.

Example 4 includes the method of example 3, where the plurality ofinstances are executed by the respective ones of the plurality ofprocessor cores in parallel and independently.

Example 5 includes the method of any of examples 1 to 4, furtherincluding receiving a user definition of a nominal generator function,and executing, with a processor, the nominal generator function togenerate the nominal data.

Example 6 includes the method of example 5, further including executing,with a first plurality of processor cores, respective ones of aplurality of instances of the anomaly generator function, each of theplurality of instances of the anomaly generator function to generaterespective ones of a plurality of user-defined anomaly data slices, andexecuting, with a second plurality of processor cores, respective onesof a plurality of instances of the nominal generator function, each ofthe plurality of instances of the nominal generator function to generaterespective ones of a plurality of nominal data slices.

Example 7 includes the method of example 6, further including splicingtogether the user-defined anomaly data slices and the nominal dataslices.

Example 8 includes the method of example 6, wherein the plurality ofinstances of the anomaly generator function and the plurality ofinstances of the nominal generator function are executed in parallel andindependently.

Example 9 includes the method of any of examples 1 to 8, furtherincluding training the machine learning model with the generated anomalydetection dataset, and testing real world anomaly detection with thetrained machine learning model.

Example 10 includes the method of any of examples 1 to 9, wherein theuser-defined anomaly data and the nominal data include respective dataslices, and further including splicing the data slices together tocombine the user-defined anomaly data with the nominal data.

Example 11 includes the method of any of examples 1 to 10, the userdefinition including source code, and further including compiling thesource code.

Example 12 includes the method of any of examples 1 to 11, the userdefinition including an executable file.

Example 13 includes an apparatus comprising an interface to receive auser definition of an anomaly generator function, a processor core toexecute the anomaly generator function to generate user-defined anomalydata, and a dataset merger to combine the user-defined anomaly data withnominal data to generate an anomaly detection dataset.

Example 14 includes the apparatus of example 13, further including amultitude of processor cores including the processor core, a pluralityof instances of the anomaly generator function executing on themultitude of processor cores in parallel.

Example 15 includes the apparatus of example 13 or 14, further includinga second processor core to execute a nominal generator function togenerate the nominal data.

Example 16 includes the apparatus of example 15, wherein the processorcore generates the user-defined anomaly data in parallel with the secondprocessor core generating the nominal data.

Example 17 includes the apparatus of any of examples 13 to 16, whereinthe anomaly data includes first data slices, the nominal data includessecond data slices, and the dataset merger splices the first data sliceswith the second data slices to form the anomaly detection dataset.

Example 18 includes a non-transitory computer-readable storage mediumcomprising instructions that, when executed, cause a machine to at leastperform receiving a user definition of an anomaly generator function,executing, with a processor, the anomaly generator function to generateuser-defined anomaly data, and combining the user-defined anomaly datawith nominal data to generate an anomaly detection dataset.

Example 19 includes the non-transitory computer-readable storage mediumof example 18, wherein the instructions, when executed, cause themachine to further perform executing, with a first plurality ofprocessor cores, respective ones of a plurality of instances of theanomaly generator function, each of the plurality of instances togenerate respective ones of a plurality of user-defined anomaly dataslices.

Example 20 includes the non-transitory computer-readable storage mediumof example 19, wherein the instructions, when executed, cause themachine to further perform executing, with a second plurality ofprocessor cores, respective ones of a plurality of instances of anominal generator function, each of the plurality of instances of thenominal generator function to generate respective ones of a plurality ofnominal data slices, and splicing together the user-defined anomaly dataslices and the nominal data slices to form the anomaly detectiondataset.

Example 21 includes a non-transitory computer-readable storage mediumcomprising instructions that, when executed, cause a machine to performthe method of any of examples 1 to 12.

Example 22 includes a system including means for receiving a userdefinition of an anomaly generator function, means for executing theanomaly generator function to generate user-defined anomaly data, andmeans for combining the user-defined anomaly data with nominal data togenerate an anomaly detection dataset.

Example 23 includes the system of example 22, further including meansfor executing a plurality of instances of the anomaly generator functionin parallel.

Example 24 includes the system of example 22 or 23, further includingmeans for executing a nominal generator function to generate the nominaldata.

Example 25 includes the system of example 24, wherein the means forgenerated the user-defined anomaly data operates in parallel with themeans for generating the nominal data.

Example 26 includes the system of any of examples 22 to 25, wherein theanomaly data includes first data slices, the nominal data includessecond data slices, and the means for combining splices the first dataslices with the second data slices to form the anomaly detectiondataset.

In this specification and the appended claims, the singular forms “a,”“an” and “the” do not exclude the plural reference unless the contextclearly dictates otherwise. Further, conjunctions such as “and,” “or,”and “and/or” are inclusive unless the context clearly dictatesotherwise. For example, “A and/or B” includes A alone, B alone, and Awith B. Further, as used herein, when the phrase “at least” is used inthis specification and/or as the transition term in a preamble of aclaim, it is open-ended in the same manner as the term “comprising” isopen ended.

Further, connecting lines or connectors shown in the various figurespresented are intended to represent exemplary functional relationshipsand/or physical or logical couplings between the various elements. Itshould be noted that many alternative or additional functionalrelationships, physical connections or logical connections may bepresent in a practical device. Moreover, no item or component isessential to the practice of the embodiments disclosed herein unless theelement is specifically described as “essential” or “critical”.

Terms such as, but not limited to, approximately, substantially,generally, etc. are used herein to indicate that a precise value orrange thereof is not required and need not be specified. As used herein,the terms discussed above will have ready and instant meaning to one ofordinary skill in the art.

Although certain example methods, apparatuses and articles ofmanufacture have been described herein, the scope of coverage of thispatent is not limited thereto. It is to be understood that terminologyemployed herein is for the purpose of describing particular aspects, andis not intended to be limiting. On the contrary, this patent covers allmethods, apparatus and articles of manufacture fairly falling within thescope of the claims of this patent.

What is claimed is:
 1. A non-transitory computer-readable storage mediumcomprising instructions that, when executed, cause a machine to atleast: obtain a user definition of a function to generate a normal dataseries; determine a timeline of data based on the function; determine atime scale associated with anomaly generation; add anomaly data to thetimeline of data based on the time scale; cause storage of the timelineas an anomaly detection dataset with the anomaly data in a data mapincluding timestamps associated with the anomaly data; and train amachine learning classifier utilizing the anomaly detection dataset. 2.The non-transitory computer-readable storage medium of claim 1, whereinthe machine learning classifier is a deep-learned-based classifier. 3.The non-transitory computer-readable storage medium of claim 1, whereinthe instructions, when executed, cause the machine to generate randomanomaly data.
 4. The non-transitory computer-readable storage medium ofclaim 1, wherein adding the anomaly data is based on an anomalyprobability parameter.
 5. The non-transitory computer-readable storagemedium of claim 1, wherein the instructions, when executed, cause themachine to generate anomaly data based on a user-defined function. 6.The non-transitory computer-readable storage medium of claim 1, whereinthe anomaly data and the normal data include respective data slices,wherein the instructions, when executed, cause the machine to splice thedata slices together to combine the anomaly data with the normal data.7. The non-transitory computer-readable storage medium of claim 1,wherein the user definition of the function includes an executable file.8. An apparatus comprising: memory; computer executable instructions;programmable circuitry to execute the computer executable instructionsto: obtain a user definition of a function to generate a normal dataseries; determine a timeline of data based on the function; determine atime scale associated with anomaly generation; add anomaly data to thetimeline of data based on the time scale; cause storage of the timelineas an anomaly detection dataset with the anomaly data in a data mapincluding timestamps associated with the anomaly data; and train amachine learning classifier utilizing the anomaly detection dataset. 9.The apparatus of claim 8, wherein the machine learning classifier is adeep-learned-based classifier.
 10. The apparatus of claim 8,programmable circuitry is to execute the computer executableinstructions to generate random anomaly data.
 11. The apparatus of claim8, wherein adding the anomaly data is based on an anomaly probabilityparameter.
 12. The apparatus of claim 8, programmable circuitry is toexecute the computer executable instructions to generate anomaly databased on a user-defined function.
 13. The apparatus of claim 8, whereinthe anomaly data and the normal data include respective data slices,programmable circuitry is to execute the computer executableinstructions to splice the data slices together to combine the anomalydata with the normal data.
 14. The apparatus of claim 8, wherein theuser definition of the function includes an executable file.
 15. Amethod comprising: obtaining a user definition of a function to generatea normal data series; determining a timeline of data based on thefunction; determining a time scale associated with anomaly generation;adding anomaly data to the timeline of data based on the time scale;causing storage of the timeline as an anomaly detection dataset with theanomaly data in a data map including timestamps associated with theanomaly data; and training a machine learning classifier utilizing theanomaly detection dataset.
 16. The method of claim 15, wherein themachine learning classifier is a deep-learned-based classifier.
 17. Themethod of claim 15, further comprising generating random anomaly data.18. The method of claim 15, wherein adding the anomaly data is based onan anomaly probability parameter.
 19. The method of claim 15, furthercomprising generating anomaly data based on a user-defined function. 20.The method of claim 15, wherein the anomaly data and the normal datainclude respective data slices, further comprising splicing the dataslices together to combine the anomaly data with the normal data. 21.The method of claim 15, wherein the user definition of the functionincludes an executable file.